WHITE WINE ANALYSIS by MANUEL

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

There are 4898 white wines in our dataset with 12 features for EDA.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Univariate Commented Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

The quality looks rather normally distributed, no transformation needed.

We can see here that the majority of white whines has a quality rating between 5 and 7, there are very limited counts at the outer boundaries of the histogram.

The fixed acidity looks rather normally distributed, maybe a little skewed to the right. I observed some outliers below 4 or beyond 11, which are not included in the histogram. I reduced the binwidth for a better view. For now, we keep the skewness in mind as it might be worth testing log10 transformation later on.

Looking at the histogram for the volatile acidity, this appears even more right skewed than the fixed acidity. I set the histogram limits from 0.1 to 0.7, knowing that there are wines with more volatile acidity in our population. The second chart shows the log10 transformed version, which comes much closer to a normal distribution, we might be able to use that later on.

Citric acid looks rather normally distributed, even though we have surprisingly many wines at 0.48/0.49 and 0.73/0.74.

Residual sugar is clearly skewed to the right, many wines are at a peak below 2.5. The boxplot over the scattered points shows the high number of outliers in the fourth quartile. The log10 doesn’t look like a bell curve, there is a drop in the middle, let’s see how we can deal with this later on.

Chlorides are a little right-skewed, the log10 transformation puts it in a rather normal distribution.

Free sulfur dioxide looks rather normally distributed. Again, I set the xlims.

Total sulfur dioxide also looks rather normally distributed. Limits for x were modified.

Density is interesting, distribution rather normal but it seems that there are high counts followed by low counts followed by high counts when we move along the x axis. Maybe something due to measuring with different lab equipment.

pH value looks straightforward normally distributed. No xlimits this time.

Sulphates are again right-skewed, the log10 transformation creates our normal distribution. No xlimites this time.

Alcohol is to some extent a special case. It’s not really normally distributed, but transformations with log10 or sqrt don’t change that, might be more depth of transformations needed at a later stage.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wines in our dataset with 12 numerical features:

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume)

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Key statistics for these variables are listed below:

1 - fixed acidity (range: 3.80 - 14.20, mean: 6.86) 2 - volatile acidity (range: 0.08 - 1.10, mean: 0.28) 3 - citric acid (range: 0.00 - 1.66, mean: 0.33) 4 - residual sugar (range: 0.60 - 65.80, mean: 6.39) 5 - chlorides (range: 0.01 - 0.35, mean: 0.05) 6 - free sulfur dioxide (range: 2.00 - 289.00, mean: 35.31) 7 - total sulfur dioxide (range: 9.00 - 440.00, mean: 138.40) 8 - density (range: 0.99 - 1.04, mean: 0.99) 9 - pH (range: 2.72 - 3.82, mean: 3.19) 10 - sulphates (range: 0.22 - 1.08, mean: 0.49) 11 - alcohol (range: 8.00 - 14.20, mean: 10.51) 12 - quality (range: 3.00 - 9.00, mean: 5.88)

Details for quality, our dependent variable aka the main feature of interest in our dataset for my analysis:

Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 5.000 6.000 5.878 6.000 9.000

For exploring what contributes to the quality derived from sensory tests, I will have a closer look at all other variables, maybe some new learnings are in for me as I’m not the white wine expert beyond knowing what alcohol and pH mean.

Other observations:

We can see here that the majority of white whines has a quality rating between 5 and 7, there are very limited counts at the outer boundaries.

Binwidth modifications and xlimit reductions were usually helpful for the plots, many variables are more or less skewed, straining the plot limits.

I made some log10 transformations to get a better plot with regards to the normal distribution, specifically for volatile acidity, residual sugar, chlorides and sulphates. This information will be useful for later analysis as we should try to work with normal distributions to avoid misleading results.

Residual sugar is clearly skewed to the right, many wines are at a peak below 2.5. The log10 doesn’t look like a bell curve, there is a drop in the middle, let’s see how we can deal with this later on.

Density is interesting, distribution rather normal but it seems that there are high counts followed by low counts followed by high counts when we move along the x axis. Maybe something due to measuring.

Alcohol is to some extent a special case. It’s not really normally distributed, but transformations with log10 or sqrt don’t change that, might be more depth of transformations needed at a later stage, maybe we can also work with ratios.

Bivariate Commented Plots Section

That’s helpful. We’re focussing on quality, so here are some observations:

Amongst the others, following observations are interesting:

We need to keep in mind the rule of thumb of small relatedness starting with r >= 0.3, which we can show here for some cases. Yet, this also depends on the case number, we have plenty of cases in our dataset and will test significance in regression models.

With the univariate histograms and correlation plot we’re good to move on with more bivariate plots.

##     item group1 vars    n      mean         sd median   trimmed      mad
## X11    1      3    1   20 0.3332500 0.14082721   0.26 0.3165625 0.088956
## X12    2      4    1  163 0.3812270 0.17346335   0.32 0.3605725 0.111195
## X13    3      5    1 1457 0.3020110 0.10006628   0.28 0.2917138 0.074130
## X14    4      6    1 2198 0.2605641 0.08814208   0.25 0.2517670 0.074130
## X15    5      7    1  880 0.2627670 0.09110644   0.25 0.2554901 0.088956
## X16    6      8    1  175 0.2774000 0.10802942   0.26 0.2667730 0.103782
## X17    7      9    1    5 0.2980000 0.05761944   0.27 0.2980000 0.044478
##      min   max range      skew   kurtosis          se
## X11 0.17 0.640 0.470 0.8810200 -0.6840048 0.031489921
## X12 0.11 1.100 0.990 1.3750398  2.1528222 0.013586698
## X13 0.10 0.905 0.805 1.4260309  3.7511289 0.002621549
## X14 0.08 0.965 0.885 1.5315969  5.0307374 0.001880050
## X15 0.08 0.760 0.680 0.8086661  0.9838638 0.003071198
## X16 0.12 0.660 0.540 0.9745983  0.8616899 0.008166257
## X17 0.24 0.360 0.120 0.2140342 -2.2094469 0.025768197

The boxplot underlines a slightly negative correlation, I factored quality to get the desired view. Looking at the means (‘x’), the highest appear at quality levels of 3 & 4. Here, the third quartiles are spread out quite a bit. We can also see a rather large number of outliers of volatile acidity at quality levels 5 & 6.

For chlorides, the correlation coefficient wiht quality was slighly negative. This can be confirmed by looking at the plot, medium chlorides are centered at medium quality (level 6), higher chlorides at quality level 5 and lower chlorides at quality level 7.

##     item group1 vars    n     mean        sd median  trimmed     mad min
## X11    1      3    1   20 170.6000 107.75833  159.5 159.5938 79.3191  19
## X12    2      4    1  163 125.2791  52.75377  117.0 124.4885 62.2692  10
## X13    3      5    1 1457 150.9046  44.08619  151.0 151.3569 45.9606   9
## X14    4      6    1 2198 137.0473  41.28622  132.0 135.3599 41.5128  18
## X15    5      7    1  880 125.1148  32.74298  122.0 123.3871 32.6172  34
## X16    6      8    1  175 126.1657  33.00633  122.0 124.2908 35.5824  59
## X17    7      9    1    5 116.0000  19.82423  119.0 116.0000  8.8956  85
##       max range        skew     kurtosis         se
## X11 440.0 421.0  0.81071555  0.095230530 24.0954953
## X12 272.0 262.0  0.20641772 -0.691207090  4.1319941
## X13 344.0 335.0 -0.03170024  0.073808470  1.1549755
## X14 294.0 276.0  0.35591447 -0.149547505  0.8806255
## X15 229.0 195.0  0.50036525  0.273050239  1.1037657
## X16 212.5 153.5  0.53693167 -0.002396868  2.4950440
## X17 139.0  54.0 -0.43928052 -1.436221665  8.8656641

The boxplot using factored quality and total sulfur dioxide shows a sligthly negative relation. However, quality level 3 shows some outliers to high total sulfur dioxide levels. Removing these shows a rather leveled picture, so that the relation between quality and total sulfur dioxide does not show a clear direction.

The scatterplot of quality and density confirms the negative relation. The line shows the density mean per quality, and is going towards lower density values, the higher the wine quality becomes.

The scatterplot of quality and alcohol confirms the positive relation. The line shows the alcohol mean per quality, and is going towards higher alcohol values, the higher the wine quality becomes. We need to watch out as we have most data points at quality levels 5-7, so should not draw the conclusion that extremely high alcohol levels will automatically lead to better wine quality ratings.

Interesting point is that density and alcohol have opposite effects on quality. Both are related to residual sugar, we will have a look at that.

Here, our suspicion of a relation between some of our independent variables is confirmed. We can observe the following based on scatter plots with a linear regression smoother (purple):

The smoother lines confirm the correlation coefficients. This example also confirms that if we want to run a stellar regression analysis of what impacts white wine quality, we need to be aware that the independent variables are related with one another. Without me being the wine expert, it might be the case that the level of residual sugar in a wine influences both density and alcohol and therefore indirectly the quality, even though sugar does not show highly significant relations to wine quality in our correlation matrix.

One last scatterplot allows the conclusion that pH values decrease when fixed acidity increases. The mean line especially between pH 3.0 - 3.4 goes rather smooth, shows some volatility below or beyond that interval.

Bivariate Analysis Summary

Box-, Scatter-, Line- and Smootherplots

We could confirm most selected correlations with the plots as described below each plot. In one case, quality vs. total sulfur dioxide, the plot incl. outliers confirms the correlation, but would probably not do so when removing the outliers. This showcases that the correlation matrix should not be viewed as the single source of truth for further analysis.

Multicollinearity

The observations between density, alcohol and residual sugar indicate that some of the independent variables are interlinked. One could look at some initial R-squared values between quality and other variables here. But to avoid misleading conclusions, I will do that in the multivariate section rather than here in the bivariate analysis.

Next steps

It will be interesting to enrich some of the plots with a third variable in the next section. Also, overall regression models will be developed to gain a high level picture, yet not running many model fit checks, e.g. against multicolli- nearity at we can do with Python VIF.

Multivariate Commented Plots Section

A closer look at the quality vs. density reveals our hypothesis. The lines by factored quality are all indicating negative relations, this observation lets us conclude that the level of alcohol declines with increasing density.

The two plots above visualize the negative relation between density and alcohol, basically no matter which quality we are looking at. The facet wrapper shows that the relation is valid for many quality levels, the highest level 9 only has a small number of data points.

Our multicollinear relation between residual sugar, density and alcohol is indi- cated here. We can see the positive relation between residual sugar and density, yet it appears that the alcohol level decreases with increasing density.

Interesting is the observation that the left side of the plot is rather light blue, meaning that also wines with high residual sugar might have high alcohol levels. We can clearly see here that density is more negatively related to alcohol than residual sugar is to alcohol.

Here is a nice representation that shows the relation of pH level and fixed acidity is not influencing the quality level.

Multivariate Plot Analysis Summary

Our plots confirmed the main results from the bivariate section. Now let’s look in more depth into the data by building regression models. For these, I will first run a model with the plain variables “as they are” and then come back to the transformations that were introduced in the univariate chapter to use bell-curve like distributions.

Multivariate Regression Commented Analysis

## 
## Calls:
## m1: lm(formula = I(quality) ~ density, data = ww)
## m2: lm(formula = I(quality) ~ density + alcohol, data = ww)
## m3: lm(formula = I(quality) ~ density + alcohol + residual.sugar, 
##     data = ww)
## m4: lm(formula = I(quality) ~ density + alcohol + residual.sugar + 
##     volatile.acidity, data = ww)
## m5: lm(formula = I(quality) ~ density + alcohol + residual.sugar + 
##     volatile.acidity + pH, data = ww)
## m6: lm(formula = I(quality) ~ density + alcohol + residual.sugar + 
##     volatile.acidity + pH + chlorides, data = ww)
## 
## ========================================================================================================
##                          m1            m2            m3            m4            m5            m6       
## --------------------------------------------------------------------------------------------------------
##   (Intercept)          96.277***    -22.492***     90.313***     74.225***     97.650***     96.932***  
##                        (4.003)       (6.165)      (12.374)      (11.977)      (12.392)      (12.436)    
##   density             -90.942***     24.728***    -87.886***    -71.546***    -96.535***    -95.761***  
##                        (4.027)       (6.079)      (12.317)      (11.923)      (12.404)      (12.455)    
##   alcohol                             0.360***      0.246***      0.286***      0.253***      0.251***  
##                                      (0.015)       (0.018)       (0.018)       (0.018)       (0.019)    
##   residual.sugar                                    0.053***      0.052***      0.064***      0.064***  
##                                                    (0.005)       (0.005)       (0.005)       (0.005)    
##   volatile.acidity                                               -2.059***     -2.024***     -2.016***  
##                                                                  (0.109)       (0.109)       (0.109)    
##   pH                                                                            0.528***      0.524***  
##                                                                                (0.076)       (0.077)    
##   chlorides                                                                                  -0.373     
##                                                                                              (0.539)    
## --------------------------------------------------------------------------------------------------------
##   R-squared             0.094         0.192         0.210         0.264         0.271         0.271     
##   adj. R-squared        0.094         0.192         0.210         0.263         0.270         0.270     
##   sigma                 0.843         0.796         0.787         0.760         0.757         0.757     
##   F                   509.911       583.290       434.085       438.646       363.847       303.254     
##   p                     0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -6111.983     -5831.127     -5776.812     -5604.126     -5580.287     -5580.047     
##   Deviance           3478.689      3101.773      3033.737      2827.187      2799.800      2799.526     
##   AIC               12229.967     11670.255     11563.624     11220.251     11174.574     11176.094     
##   BIC               12249.456     11696.241     11596.107     11259.231     11220.050     11228.067     
##   N                  4898          4898          4898          4898          4898          4898         
## ========================================================================================================

Based on our findings before, I have developed an approach with six models.

Model 1: Density has a highly significant negative influence on the quality rating. It can explain about 9.4% of the variance of quality level ratings. The denser the wine, the worse the quality is rated.

Model 2: Alcohol has a highly significant positive influence on the quality rating. It can explain about 9.8% of the variance in quality level ratings. Its effect size is less than the one of density, this can be explained by the different measure- ment scales / the very small range of density variations, leading to very strong p-value moves if density level is increased by 1. Interesting is that density in model 2 shows a positive relation to quality, opposite to all other models.

The more alcohol, the better the wine quality is rated.

Model 3: Residual sugar also has highly significant positive influence on the quality ratings. The more residual sugar, the better the wine quality is rated. The comprehensive regression model rejects the initial - bivariate correlation-based - hypothesis that the effect of residual sugar on the quality is negativ, here is a positive relation.

Model 4: Volatile acidity has highly significant negative influence on the quality ratings. The higher the volatile acidity is, the worse the wine quality is rated. The comprehensive regression model rejects the initial - bivariate correlation-based - hypothesis that the effect of volatile acidity on the quality is negative, here is a positive relation.

Model 5: pH value has a highly significant positive influence on the quality ratings. The higher the pH value, the better the quality is perceived.

Model 6: I tested chlorides on top of model 5, but it was neither significant nor did it improve the adjusted R-squared.

Adjusted R-squared: Model 1 and 2 contribute the most share of our overall adjusted R-squared, followed by volatile acidity and residual sugar. The overall value of 0.270 means that our model is able to explain about 27% of the variance in quality ratings, which is quite good but also means that there are some 70% that are not explained by the variables we built into our models.

## 
## Calls:
## m1: lm(formula = I(quality) ~ density, data = ww)
## m2: lm(formula = I(quality) ~ density + alcohol, data = ww)
## m3: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar), 
##     data = ww)
## m4: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) + 
##     log10(volatile.acidity), data = ww)
## m5: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) + 
##     log10(volatile.acidity) + pH, data = ww)
## m6: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) + 
##     log10(volatile.acidity) + pH + log10(chlorides), data = ww)
## 
## ===============================================================================================================
##                                 m1            m2            m3            m4            m5            m6       
## ---------------------------------------------------------------------------------------------------------------
##   (Intercept)                 96.277***    -22.492***     49.045***     46.080***     54.106***     52.053***  
##                               (4.003)       (6.165)       (9.711)       (9.329)       (9.421)       (9.463)    
##   density                    -90.942***     24.728***    -46.736***    -44.982***    -54.192***    -52.260***  
##                               (4.027)       (6.079)       (9.652)       (9.271)       (9.402)       (9.439)    
##   alcohol                                    0.360***      0.284***      0.311***      0.295***      0.286***  
##                                             (0.015)       (0.017)       (0.016)       (0.016)       (0.017)    
##   log10(residual.sugar)                                    0.465***      0.554***      0.612***      0.599***  
##                                                           (0.049)       (0.047)       (0.048)       (0.049)    
##   log10(volatile.acidity)                                               -1.519***     -1.502***     -1.487***  
##                                                                         (0.075)       (0.075)       (0.075)    
##   pH                                                                                   0.399***      0.393***  
##                                                                                       (0.074)       (0.074)    
##   log10(chlorides)                                                                                  -0.193*    
##                                                                                                     (0.088)    
## ---------------------------------------------------------------------------------------------------------------
##   R-squared                    0.094         0.192         0.207         0.269         0.273         0.274     
##   adj. R-squared               0.094         0.192         0.207         0.268         0.272         0.273     
##   sigma                        0.843         0.796         0.789         0.758         0.756         0.755     
##   F                          509.911       583.290       425.851       449.179       367.191       307.041     
##   p                            0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood           -6111.983     -5831.127     -5786.594     -5588.654     -5574.194     -5571.768     
##   Deviance                  3478.689      3101.773      3045.878      2809.382      2792.843      2790.078     
##   AIC                      12229.967     11670.255     11583.188     11189.307     11162.387     11159.536     
##   BIC                      12249.456     11696.241     11615.671     11228.287     11207.864     11211.509     
##   N                         4898          4898          4898          4898          4898          4898         
## ===============================================================================================================

In this regression, I used some transformed variables. These transformations make effect interpretations more difficult, therefore I will focus on the effect direction and related changes in the adjusted R-squared that improve the overall model fit.

Model 1: Density variable is the same, effects same as before.

Model 2: Alcohol variable is the same, effects same as before.

Model 3: Residual sugar is log10 transformed, effect direction the same but overall adjusted R-squared a little less than before. Yet we stick to the transformed variable as the distribution requires transformation.

Model 4: Volatile acidity is log10 transformed, effect direction changed compared to before: the higher volatile acidity, the lower the perceived wine quality. The adjusted R-squared slightly improved compared to the model before. We stick to the transformed variable as the distribution requires transformation.

Model 5: pH variable is the same, effects a little less as before.

Model 6: Chlorides is log10 transformed, effect is negative and significant. The adjusted R-squared now overall improved to 0.273. We stick to the transformed variable as the distribution requires transformation.

Adjusted R-squared: With the model modifications, we have improved our adjusted R-squared from 0.270 to 0.273, a little improvement compared to before.

Outliers:

Outlier values were not cut off for these variables to reduce complexity of the analysis, this data cleaning exercise should be thoroughly conducted with more time and is likely to improve the model performance.


Final Plots and Summary

White wine quality plot

White wine quality description

The perceived quality of white wine looks rather normally distributed.

We can see here that the majority of white whines has a quality rating between 5 and 7, there are very limited counts at the outer boundaries of the histogram.

Correlation matrix plot

Correlation matrix description

We’re focussing on quality, so here are some observations: - lower volatile acidity seems to have a slightly positive influence - same applies for chlorides and total sulfur dioxide - lower density seems to have a rather positive influence on quality - higher alcohol seems to have a strongly positive impact on quality

Amongst the others, following observations are interesting:

  • density has a strong negative correlation with alcohol and a strong positive relation with residual sugar, alcohol has a strong negative correlation with residual sugar. These are the strongest effects observed
  • pH value is negatively related to fixed acidity
  • levels of free sulfur dioxide and total sulfur dioxide are positively correlated with each other and positively correlated with density
  • fixed acidity is not related to volatile acidity
  • chlorides are positively correlated with density, but negatively with alcohol

We need to keep in mind the rule of thumb of small relatedness starting with r >= 0.3, which we can show here for some cases. Yet, this also depends on the case number, we have plenty of cases in our dataset.

Relation density/alcohol/residual sugar plot

Relation density/alcohol/residual sugar description

Here, our suspicion of a relation between some of our independent variables is confirmed. We can observe the following based on scatter plots with a linear regression smoother (purple):

  • Density increases with lower alcohol levels
  • Density increases with higher residual sugar levels
  • Alcohol decreases with higher residual sugar levels

The smoother lines confirm the correlation coefficients. This example also confirms that if we want to run a stellar regression analysis of what impacts white wine quality, we need to be aware that the independent variables are related with one another. Without me being the wine expert, it might be the case that the level of residual sugar in a wine influences both density and alcohol and therefore indirectly the quality, even though sugar does not show highly significant relations to wine quality in our correlation matrix.

Regression Output

## 
## Calls:
## m1: lm(formula = I(quality) ~ density, data = ww)
## m2: lm(formula = I(quality) ~ density + alcohol, data = ww)
## m3: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar), 
##     data = ww)
## m4: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) + 
##     log10(volatile.acidity), data = ww)
## m5: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) + 
##     log10(volatile.acidity) + pH, data = ww)
## m6: lm(formula = I(quality) ~ density + alcohol + log10(residual.sugar) + 
##     log10(volatile.acidity) + pH + log10(chlorides), data = ww)
## 
## ===============================================================================================================
##                                 m1            m2            m3            m4            m5            m6       
## ---------------------------------------------------------------------------------------------------------------
##   (Intercept)                 96.277***    -22.492***     49.045***     46.080***     54.106***     52.053***  
##                               (4.003)       (6.165)       (9.711)       (9.329)       (9.421)       (9.463)    
##   density                    -90.942***     24.728***    -46.736***    -44.982***    -54.192***    -52.260***  
##                               (4.027)       (6.079)       (9.652)       (9.271)       (9.402)       (9.439)    
##   alcohol                                    0.360***      0.284***      0.311***      0.295***      0.286***  
##                                             (0.015)       (0.017)       (0.016)       (0.016)       (0.017)    
##   log10(residual.sugar)                                    0.465***      0.554***      0.612***      0.599***  
##                                                           (0.049)       (0.047)       (0.048)       (0.049)    
##   log10(volatile.acidity)                                               -1.519***     -1.502***     -1.487***  
##                                                                         (0.075)       (0.075)       (0.075)    
##   pH                                                                                   0.399***      0.393***  
##                                                                                       (0.074)       (0.074)    
##   log10(chlorides)                                                                                  -0.193*    
##                                                                                                     (0.088)    
## ---------------------------------------------------------------------------------------------------------------
##   R-squared                    0.094         0.192         0.207         0.269         0.273         0.274     
##   adj. R-squared               0.094         0.192         0.207         0.268         0.272         0.273     
##   sigma                        0.843         0.796         0.789         0.758         0.756         0.755     
##   F                          509.911       583.290       425.851       449.179       367.191       307.041     
##   p                            0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood           -6111.983     -5831.127     -5786.594     -5588.654     -5574.194     -5571.768     
##   Deviance                  3478.689      3101.773      3045.878      2809.382      2792.843      2790.078     
##   AIC                      12229.967     11670.255     11583.188     11189.307     11162.387     11159.536     
##   BIC                      12249.456     11696.241     11615.671     11228.287     11207.864     11211.509     
##   N                         4898          4898          4898          4898          4898          4898         
## ===============================================================================================================

In this regression, I used some transformed variables. These transformations make effect interpretations more difficult, therefore I will focus on the effect direction and related changes in the adjusted R-squared that improve the overall model fit.

Model 1: Density has a highly significant negative influence on the quality rating. It can explain about 9.4% of the variance of quality level ratings. The denser the wine, the worse the quality is rated.

Model 2: Alcohol has a highly significant positive influence on the quality rating. It can explain about 9.8% of the variance in quality level ratings. Its effect size is less than the one of density, this can be explained by the different measure- ment scales / the very small range of density variations, leading to very strong p-value moves if density level is increased by 1. Interesting is that density in model 2 shows a positive relation to quality, opposite to all other models.

The more alcohol, the better the wine quality is rated.

Model 3: Residual sugar also has highly significant positive influence on the quality ratings. The more residual sugar, the better the wine quality is rated. The comprehensive regression model rejects the initial - bivariate correlation-based - hypothesis that the effect of residual sugar on the quality is negativ, here is a positive relation.

Model 4: Volatile acidity has highly significant negative influence on the quality ratings. The higher the volatile acidity is, the worse the wine quality is rated. The comprehensive regression model rejects the initial - bivariate correlation-based - hypothesis that the effect of volatile acidity on the quality is negative, here is a positive relation.

Model 5: pH value has a highly significant positive influence on the quality ratings. The higher the pH value, the better the quality is perceived.

Model 6: Chlorides is log10 transformed, effect is negative and significant.

Adjusted R-squared: With the model modifications, we have improved our adjusted R-squared from 0.270 to 0.273, a little improvement compared to before.


Reflection

The univariate analysis uncovered insights about the distrubition of the data that could be used for later work in the regression model. In theory, these might also have been applicable for the bivariate assessments, but would have made the interpretation more difficult. As I wanted to focus the interpretation on the regression models, I applied the transformations in the last stage of the project.

Bivariate and multivariate analyses led to insights that make it easier to understand which factors contribute to a high perceived white wine quality.

Outlier values were not cut off for the transformed variables to reduce complexity of the analysis. This data cleaning exercise should be thoroughly conducted with more time and is likely to improve the model performance.

Residual sugar in the log10 version has improved the model quality, but when looking at the distribution of the variable, some other transformation might lead to better effects.

As a next step, the approach and the results could be used for the red wine data and shared with my local wine dealer to discuss the results. For wine producers, these insights can serve as key to bring their portfolio wines closer to the taste of the customers.

A nice add-on to the dataset would be the price. A common hypothesis for wine goes: the more expensive, the higher the perceived quality. This has been dis- proven in many lab-scale tests I’ve seen on TV, but would be interesting to see how this relates in our sample of many thousand white and red wines.